Machine Learning Model Application and Comparison in Actuated Traffic Signal Forecasting

Traffic signal forecasting plays a significant role in intelligent traffic systems since it can predict upcoming traffic signal without using traditional radio-based direct communication with infrastructures, which causes high risk in the communication security. Previously, mathematical and statistical approach has been adopted to predict fixed time traffic signals, but it is no longer suitable for modern traffic-actuated control systems, where signals are dependent on the dynamic requests from traffic flows. And as a large amount of data is available, machine learning methods attract more and more attention. This paper views signal forecasting as a time-series problem. Firstly, a large amount of real data is collected by detectors implemented at an intersection in Hanover via IoT communication among infrastructures. Then, Baseline Model, Dense Model, Linear Model, Convolutional Neural Network, and Long Short-Term Memory (LSTM) machine learning models are trained by one-day data and the results are compared. At last, LSTM is selected for a further training with one-month data producing a test accuracy over 95%, and the median of deviation is only 2 s. Moreover, LSTM is further evaluated as a binary classifier, generating a classification accuracy over 92% and AUC close to 1.


Introduction
Due to a rapid development of modern traffic, there is an explosion of traffic flows. As a result, more and more air pollution and wasted energy are caused by stop-and-go driving behaviours at intersections. To mitigate such problems, some smart applications have been developed, e.g., Green Light Optimal Speed Advisory (see [1]) for vehicles to avoid unnecessary stops at signalized intersections. The precondition for all these smart applications is that the signals must be known in advance. Normally, the information can be obtained by Signal Phase and Timing messages broadcast by road side units in modern Cooperative Intelligent Transport Systems. As shown in Figure 1, it is a basic communication structure for V2I (Vehicle to Infrastructure) based on IoT (Internet of Things). The field device is constructed by a TLC (Traffic Light Controller) connected with a RSU (Road Side Unit), which first broadcasts current MAP (map as intersection geometry) and SPAT (Signal Phase and Timing) messages. Then, the traffic center delivers some application data from TLC to the public transport strategy computer, which finally generates future calculated MAP and SPAT messages. Therefore, both the current and future SPAT messages can be received by an OBU (On-Board Unit) implemented in a vehicle for further processing. However, more traffic flows are competing to request the signal messages, such as autonomous vehicles and public transport. The future traffic signals can be affected by sensors detecting vehicles in line. On the one hand, in such case, the priority of public transport cannot be guaranteed definitely. On the other hand, it increases the risk of such a radio-based communication and the cost of large amounts of communication modules implemented in intelligent transport systems. Therefore, methods to predict future traffic signals to avoid a heavy direct communication with infrastructures are being explored. Previously, the main method to forecast upcoming traffic signal was the mathematical and statistical approach. Wang et al. [2] used Kalman Filter to predict traffic state. Menig et al. [3] adopted Markow chains to calculate the probabilities of occurrence of several signal states. However, these approaches can only produce unsatisfactory accuracy and transportability for actuated traffic systems, in which traffic signal changes are dependent on the requests from different traffic flows. Later, due to the explosion of the large data pool collected by different detectors, machine learning models attracted more attention [4]. Weisheit and Hoyer [5] applied Support Vector Machines to predict future possible traffic states, where the states were divided into different possible groups for classification. Heckmann et al. [6] further defined stages to group-related signal states that can forecast three states in advance. The authors viewed signal prediction as a regression problem, and compared the performance of different combinations of Extreme-Gradient-Boosting and Bayesian Networks (see [7]). However, these works have to assume that the traffic cycle time is fixed, which is not applicable for actuated traffic signals. Another research perspective is to view signal prediction as a time-series forecasting problem [8,9]. Khosravi et al. [10] used machine learning to predict time-series wind speed data of a wind farm in Brazil. The researchers compared Adaptive Neuro-Fuzzy Inference System and hybrid models, Multilayer Feed-Forward Neural Network, Support Vector Regression, Fuzzy Inference System, and Group Method of Data Handling type neural network, which provided a possibility to deal with traffic signals as time-series data. Genser et al. [11] made efforts to standardize Signal Phase and Timing messages to forecast the residue time of each phase. They applied a Random Survival Forest model to forecast time to green compared with the baseline models of Auto-Regressive Integrated Moving Average and Linear Regression.
They mentioned the high potential of the Long Short-Term Memory (LSTM) model dealing with such time-series problem. Zhou et al. [12] proposed the Informer to solve the problem of long sequence time-series forecasting. It is a modified Transformer that increases the prediction capacity. It was successfully applied to predict electricity consumption for a long period. Tang et al. [13] rethought one-dimensional convolutional neural networks (1D-CNNs) from the omni-scale for time-series classification tasks and provided a stronger baseline. Therefore, this research explores some machine learning methods to predict future traffic signals as time-series data. The LSTM, Baseline Model, Linear Model, Dense Model, and Convolutional Neural Network are applied and compared for traffic signal forecasting in this work.
The rest of this paper is organized as follows. Section 2 introduces the ways in which the collected data are processed, as well as the basic structure of the researched machine learning models. Section 3 describes forecasting results and makes further analysis on the test accuracy and basic metrics for different time horizons. Section 4 discusses the results and provides future research direction.

Data Preparation
In order to train machine learning models, a large amount of traffic data needs to be collected by detectors implemented in the traffic system. As shown in Figure 2, the signalized Intersection 125, locating at the junction of Wallensteinstraße and Göttinger Chaussee in Hanover, is selected for research. It is a complex 4-leg intersection including footpaths, bike lanes, motorways, and tram tracks.

Data Processing
The originally collected data is shown in Table 1, including all signal light states, requests from all traffic groups, detector data, timers, and so on. This table records the data every second for one whole day, including, in total, 86,400 s (rows) and 171 features (columns). Table 1. Collected data of one day. In Table 1, only traffic light states are labelled with characters, where "R" means the RED signal, "G" means the GREEN signal, "A" means the Attention (YELLOW) signal, "a" is the Acoustic signal for the blind, "S" means Start, "D" means the Dark signal. In order to train machine learning models, the data sheet has to be transformed into a full numerical table. Because only the releasing signals (i.e., "G" and "a") mean the traffic participants are free to go through the intersection, they should be set to 1, while others have to be set to 0. If the detectors receive requests for crossing, it is set to 1. A full numerical data sheet after transformation is shown in Table 2. Table 2. Transformed full numerical data of one day.
· · · · · · · · · · · · · · · · · · · · · · · · 86,395 Obviously, it is not necessary to take all features into consideration because the all-zero and redundant features not only seriously disrupt the forecasting accuracy but also heavily burden the calculation and training capability. After the filtration of features, the target to forecast in this work is selected as K01R, which is a signal controlled for one motor way in Figure 2, marked by K1Ra and K1Rb. After researching the phase logic installed in signal controllers, the key factors stimulating the changes among different phases are found. These key features are, respectively, the requests to cross ('ANF'), detector data ('BK'), approach timers ('TAN'), emergency requests ('AFS'), and reporting points ('MPN'). After refining, the number of features is cut down to 38.

Time Window Generation
In this research, the signal prediction is viewed as a time-series problem. The multihorizon direct forecasting method is adopted, which means a future sequence could be forecasted directly by an input of a historical observed sequence (see [14]). The sequenceto-sequence structure is described in Figure 3. The next step is to define the input sequence and forecasting sequence, which is named Time Window. As shown in Figure 4, the window size is set to be 120 s, the forecasting sequence is 30 s. The input historical observations are marked by blue points, the practical labels are marked by green points. And every time one round of prediction is finished, the time window slides forward for 30 s, until all timestamps are forecast. In this work, 70% of data is selected to be the training set, 20% is the validation set, and the remaining 10% is selected for testing.

Machine Learning Models
Five basic machine learning models are trained for signal prediction, which are the Baseline Model, Linear Model, Dense Model, Convolutional Neural Network (CNN), and Long Short-Term Memory (LSTM). Of these, the Baseline Model and the Linear Model are selected as benchmark models for further comparison with others. All these models are built with Tensorflow on Google Colab. To train these models, some parameters are defined as:

Baseline Model
The baseline model adopted in this work is the Last Baseline Model. As shown in Figure 5, the predictions are only a repetition of the last seen input time step [15].

Dense Model
The Dense Model is constructed as in Figure 6. It includes one fully connected layer with 512 output units, the relu activation function and one dense layer with the sigmoid activation function.

Linear Model
Linear Model has only a simple linear layer which can be viewed as a simplified Dense Model (see Figure 7).

CNN Model
As shown in Figure 8, the CNN Model consists of a one-dimensional convolution layer and one fully connected layer with the sigmoid activation function. The number of output filters in the convolution is 120.

LSTM
LSTM, a kind of Recurrent Neural Network, is viewed as one of the most promising approaches to forecast future time-series. As shown in Figure 9, it consists of an LSTM layer with 120 units and a fully connected layer. The internal structure of LSTM and the connections are presented in Figure 10, where X t means the input of feature matrix at current moment; similarly, X t−1 and X t+1 are inputs at the last and the next moment. The hidden layer outputs are represented by h t−1 , h t , and h t+1 . C t is the internal memory state of the module which is called the cell state. Normally, LSTM modules are connected in a form of chains. It consists of a forget gate, an update gate, and an output gate. The forget gate is represented by f t which decides how much information from the previous state should be forgotten. As described in Equation (1), f t is a number between 0 and 1, which can be calculated by X t and h t−1 , where W is a weight matrix and b is a bias.
The update gate includes two parts, the input update i t and the candidate cell state g t . i t decides how much new information should be updated (see Equation (2)). g t provides new candidate values that can be updated (see Equation (3)).
A new cell state C t is generated after a combination of the forget gate and the update gate (see Equation (4)).
After the obtainment of the new cell state C t , the output gate o t decides which part of the cell state C t should be output as a hidden layer output h t (see Equation (5)).
Finally, the predicted future values of the time-series y t can be obtained by Equation (6).

Binary Accuracy
The binary accuracy of these five machine learning models is depicted in Figure 11, where the validation accuracy is marked by a blue bar, while the red bar represents the test accuracy. Though all models have similar forecasting ability, only the ones performing better than benchmark models should draw attention. More intuitively, both test accuracy and validation accuracy are listed in Table 3. Obviously, LSTM is the only model that outperforms the others, which means LSTM is optimal to be selected for signal forecasting. Table 3. Binary accuracy of machine learning models trained with one-day data.

Models
Validation Accuracy Test Accuracy

Basic Metrics
As described above, for most of time, traffic signals are labelled with 0. In other words, there is a possibility that if the predictions are always set to be 0, a high binary accuracy can be obtained. In order to avoid this case, basic metrics are adopted to further evaluate these models. The related parameters are calculated in Equation (7) As calculated by Equation (7), the accuracy ACC describes how much data are correctly predicted, but when the proportion of one and zero is unbalanced, this cannot reflect the real prediction situation. TPR is the True Positive Rate which describes how much actual one data is correctly predicted. For all data predicted as one, the PPV (Positive Predictive Value) describes the ratio of correct prediction. Because of the drawbacks of ACC, F1 score is calculated as a Harmonic Mean of TPR and PPV, which can better describe the prediction accuracy. MCC is Matthew's correlation coefficient, which ranges between −1 and 1. If MCC equals zero, it usually means totally random predictions, while one means a perfect classifier. After calculation, the basic metrics of the researched machine learning models are listed in Table 4. As shown in Table 4, even though both the Dense model and the CNN model have a high ACC of over 89%, their other metrics are 0. It means the prediction has no True Positive values. Obviously, LSTM outperforms other models for each metric.
In order to further analyze the diagnostic ability of a binary classifier system when the discrimination threshold varies, a receiver operating characteristic (ROC) curve is adopted. As shown in Figure 12, the ROC curve is created by plotting TPR against FPR as various threshold settings. The Area under the Curve (AUC) is further calculated to evaluate the classification ability. When the AUC score is 0.5, it means a totally random prediction, while a perfect classifier has one as the AUC score. It is more intuitive to find that LSTM performs the best compared with other models.

LSTM Trained by Data of One Month
Because of the excellent forecasting performance of LSTM based on one-day data, a further training with one-month data of February follows. Rather than direct training with one-month data, the data sheet is divided into three groups to observe the accuracy changes in detail. After calculation, the training results of 1 day, 10 days, 20 days, and 28 days are depicted in Figure 13. It is interesting to find that the test accuracy reduces a little bit when the data horizon is extended from 1 day to 10 days. This could be caused when the model is trained well for one-day data (e.g., workday), but it may not predict well for another day (e.g., weekend). Nevertheless, as the data cover more time horizons, the corresponding general predicting accuracy increases. But the accuracy of one month is not improved significantly.

Deviation Calculation
Since these machine learning models are applied for traffic signal prediction, one of the most vital evaluation indicators is the time deviation. Even an error of a few seconds can probably cause a serious traffic accident, especially at the moment of signal changes. Therefore, the deviations of the LSTM model trained by data of 10 days, 20 days, and 28 days are calculated for comparison. As shown in Figure 14, the violin plot can describe the distribution of deviations. The flatter the shape of the violin plot and the lower the median, the more concentrated and lower the deviations. The reason that the maximum deviation is approximate 20 s is that the model probably misses one GREEN signal, since it is really hard to guarantee the trained LSTM can catch each future signal changes with a 100% accuracy. However, the median of deviations is approximately 2 s, which means if the deviations exist, 50% of them are below 2 s. Even though the deviations cannot be avoided, about 95% of forecasting signals have no deviations. From a general perspective, there is a slight tendency that the longer horizon of the training data, the better the forecasting quality.

Segment Accuracy Calculation
Another attractive evaluation indicator is the segment accuracy. In this work, the forecasting sequence of 30 s is cut into three equal segments to study whether the accuracy will be influenced by the length of forecasting horizon. These segments are defined as follows, where t represents the current day and t + 1 represents the next day: • Segment 1: t + 1 -> t + 10; • Segment 2: t + 11 -> t + 20; • Segment 3: t + 21 -> t + 30.
The forecasting accuracies for these three segments are calculated, respectively, for three data horizons. As shown in Figure 15, due to the usage of their own test set, there is a vibration of forecasting accuracy for these three data horizons. However, it is obvious that the closer the segment to the current time t, the higher the accuracy, with the accuracy of these three segments showing a gradient downwards. As a result, such machine learning models cannot predict a really long future series.

Basic Metrics of LSTM
Similarly, related basic metrics of LSTM for different time horizons are calculated. As shown in Table 5, with the time horizons expanding, the accuracy increases. For a more intuitive view, the ROC curves of LSTM with different time horizons are depicted in Figure 16.
As shown in Figure 16, there is a significant improvement of AUC from the horizon of 1 day to 10 days. However, from 10 days to 28 days, the difference is not so obvious. That means that considering the training time and the complexity of the model, LSTM trained by 10-day data can be a good choice for further use.

Discussion
This paper provides adequate results that LSTM has a satisfactory performance in time-series forecasting problems with a test accuracy of over 95%. Further validation is performed to calculate the basic metrics of the researched models, including ACC, PPV, TPR, the F1 score, and MCC, all of which prove that the LSTM model outperforms other compared models for time-series forecasting. Furthermore, the ROC curves of LSTM for different horizons are drawn and show that LSTM trained by more than 10 days can have a significant improvement in terms of accuracy, while the differences among 10 days, 20 days, and 28 days are not so obvious. Another finding is that the deviations between the forecasting sequence and practical traffic signals should draw more attention, since there is still a quantity of deviations located at a high level. And the intersection chosen for research in this paper has no detectors for requests from buses. Therefore, the future work should focus on the development of a hybrid model of LSTM to narrow down the deviations to a reasonable range and find the influence of public transport when it is assigned priorities on the road, which will impact directly the traffic actuated signals. There is another situation that could not be neglected: When accident or jam happens, how will the prediction accuracy change? Theoretically, the vehicles can be detected by sensors, which will be input as a feature value. But due to lack of accident or jam data to train the machine learning models, the performance of these models cannot be verified in this work.

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations
The following abbreviations are used in this manuscript: